PLOS Digital Health — Latest Matching Preprints

1

Recovering Clinical Detail in AI-Generated Responses for Low Back Pain Through Prompt Design

Basharat, A.; Hamza, O.; Rana, P.; Odonkor, C. A.; Chow, R.

2026-04-23 pain medicine 10.64898/2026.04.21.26351437 medRxiv

Top 0.1%

22.9%

Show abstract

Introduction Large language models are increasingly being used in healthcare. In interventional pain medicine, clinical reasoning is essential for procedural planning. Prior studies show that simplified prompts reduce clinical detail in AI-generated responses. It remains unclear whether this reflects knowledge loss or simply prompt-driven suppression of information. Methods We performed a controlled comparative study using 15 standardized low back pain questions representing common interventional pain questions. Each question was submitted to ChatGPT under three conditions, professional-level prompt (DP), fourth-grade reading-level prompt (D4), and clinician-directed rewriting of the D4 response to a medical level (U4[->]MD). No follow-up prompting was allowed. Three physicians independently rated responses for accuracy using a 0-2 ordinal scale. Clinical completeness was determined by consensus. Word count and Flesch-Kincaid Grade Level (FKGL) were also measured. Paired t-tests compared conditions. Results Accuracy was highest with professional prompting (1.76). Accuracy declined with the fourth-grade prompt (1.33; p = 0.00086). When simplified responses were rewritten for clinicians, accuracy returned to baseline (1.76; p {approx} 1.00 vs DP). Clinical completeness followed the same pattern showing DP 80.0%, D4 6.7%, U4[->]MD 73.3%. Fourth-grade responses were shorter and less complex. Upscaled responses were more complex and similar in length to professional responses. Inter-rater reliability was low (Fleiss {kappa} = 0.17), but trends were consistent across conditions. Conclusions Reduced clinical detail under simplified prompts appears to reflect constrained output rather than loss of knowledge. Clinician-directed reframing restores omitted content. LLM performance in interventional pain depends strongly on prompt design and intended audience.

2

Comparison of foundation models and transfer learning strategies for diabetic retinopathy classification

Li, L. Y.; Lebiecka-Johansen, B.; Byberg, S.; Thambawita, V.; Hulman, A.

2026-04-20 health informatics 10.64898/2026.04.17.26351092 medRxiv

Top 0.1%

22.5%

Show abstract

Diabetic retinopathy (DR) is a leading cause of vision impairment, requiring accurate and scalable diagnostic tools. Foundation models are increasingly applied to clinical imaging, but concerns remain about their calibration. We evaluated DINOv3, RETFound, and VisionFM for DR classification using different transfer learning strategies in BRSET (n = 16,266) and mBRSET (n = 5,164). Models achieved high discrimination in binary classification (normal vs retinopathy) in BRSET (AUROC 0.90-0.98), with DINOv3 achieving the best under full fine-tuning (AUROC 0.98 [95% CI: 0.97-0.99]). External validation on mBRSET showed decreased performance for all models regardless of the fine-tuning strategy (AUROC 0.70-0.85), though fine-tuning improved performance. Foundation models achieved strong discrimination but poor calibration, generally overestimating DR risk. While the generalist model, DINOv3, benefited from deeper fine-tuning, miscalibration remained evident. These findings underscore the need to improve calibration and the comprehensive evaluation of foundation models, which are essential in clinical settings. Author summaryArtificial intelligence is increasingly being used to detect eye diseases such as diabetic retinopathy from retinal images. Recent advances have introduced "foundation models," which are trained on large datasets and can be adapted to new tasks. We aimed to evaluate how well these models perform in a clinical prediction context, with a focus not only on accuracy but also on how reliably they estimate disease risk. In this study, we compared different types of foundation models using two independent datasets from Brazil. We found that while these models were generally good at distinguishing between healthy and diseased eyes, their predicted risks were often poorly calibrated. In other words, the estimated probabilities did not consistently reflect the true likelihood of disease. We also examined whether adapting the models to the target population could improve performance. Although this approach led to improvements, calibration issues remained. However, post-training correction improved the agreement between predicted risks and observed outcomes. Our findings highlight an important gap between model performance and clinical usefulness. We suggest that improving the reliability of risk estimates is essential before such systems can be safely used in healthcare.

3

Stakeholder perspectives on the use of enhanced mobile phone capabilities for public health surveillance for non-communicable disease risk factors: A qualitative study

Mwaka, E. S.; Nabukenya, S.; Kasiita, V.; Bagenda, G.; Rutebemberwa, E.; Ali, J.; Gibson, D.

2026-04-23 health informatics 10.64898/2026.04.22.26351443 medRxiv

Top 0.1%

17.3%

Show abstract

Background: Mobile phone-based tools are increasingly used to collect data on non-communicable disease (NCD) risk factors, particularly in low-resource settings where traditional data collection systems face operational and infrastructural constraints. This study examined stakeholder perspectives on the use of enhanced mobile phone-based capabilities to support the collection of public health surveillance data on NCD risk factors in low-resource settings. Methods: An exploratory qualitative study was conducted between November 2022 and July 2023. Twenty in-depth interviews were conducted with public health specialists, ethicists, NCD researchers, health informaticians, and policy makers in Uganda. Thematic analysis was used to interpret the results. Results: Four themes emerged from the data, including benefits of using mobile phone capabilities for NCD risk factor data collection; ethical, legal, and social implications; perceived challenges of using such mobile phone capabilities; and proposed solutions to improve the utility of phone-based capabilities in data collection on NCD risk factors. Participants recognized the potential of mobile technologies to improve data collection efficiency and expand access to hard-to-reach populations. However, concerns emerged regarding inadequate informed consent, risks to privacy and confidentiality, unclear data ownership, and vulnerabilities created by inconsistent enforcement of data protection laws. Social concerns included low digital literacy, unequal access to mobile devices, and fear of stigmatization. Participants emphasized the need for transparent communication, robust data governance, and community engagement. Conclusion: Mobile phone-based systems can strengthen the collection of NCD risk factor data in low-resource settings; however, their benefits depend on addressing key ethical, legal, and social challenges. To ensure responsible deployment, digital health initiatives must prioritize participant autonomy, data protection, equity, and trust building. Integrating contextualized ethical, legal, and social considerations into design and policy frameworks will be essential to leveraging mobile technologies in ways that support inclusive and effective NCD prevention and control.

4

Tuberculosis in households with infectious cases in Kampala city: Harnessing health data science for new insights on an ancient disease with persistent, unresolved problems (DS-IAFRICA TB) study protocol

Nassinghe, E.; Musinguzi, D.; Takuwa, M.; Kamulegeya, R.; Nabatanzi, R.; Namiiro, S.; Mwikirize, C.; Katumba, A.; Kivunike, F. N.; Ssengooba, W.; Nakatumba-Nabende, J.; Kateete, D. P.

2026-04-25 infectious diseases 10.64898/2026.04.23.26351571 medRxiv

Top 0.1%

17.1%

Show abstract

Tuberculosis (TB) is prevalent in Uganda and overlaps with a high rate of HIV/TB coinfection. While nearly all hospital-based TB cases in Kampala, the capital of Uganda, show clear TB symptoms, 30% or more of undiagnosed TB cases found through active screening are asymptomatic. Additionally, the host risk factors for TB in Kampala cannot be distinguished from environmental risk factors. These TB-specific challenges are just part of the complexity, especially in areas with high HIV/AIDS burden. Data science techniques, especially Artificial Intelligence (AI) and Machine Learning (ML) algorithms, could help untangle this complexity by identifying factors related to the host, pathogen, and environment, which are difficult to explain or predict with traditional/conventional methods. In this project, we will use health data science approaches (AI/ML) to identify factors driving TB transmission within households and reasons for anti-TB treatment failure. We will utilize the computational resources at Makerere University and available demographic, clinical, and laboratory data from TB patients and their contacts to develop AI and ML algorithms. These will aim to: (1) identify patients at baseline (month 0) unlikely to convert their sputum or culture results by months 2 and 5, thus at risk of failing TB treatment; (2) identify household contacts of TB cases who are at risk of developing TB disease, as well as contacts who may resist TB infection despite repeated exposure to M. tuberculosis. Achieving these objectives will provide evidence that data science methods are effective for early detection of potential TB cases and high-risk patients, thereby helping to reduce TB transmission in the community. The study protocol received approval from the School of Biomedical Sciences IRB, protocol number SBS-2023-495.

5

Leveraging Predictive AI and LLM-Powered Trial Matching to Improve Clinical Trial Recruitment: A Usability Assessment of Trialshub

Blankson, P.-K.; Hussien, S.; Idris, F.; Trevillion, G.; Aslam, A.; Afani, A.; Dunlap, P.; Chepkorir, J.; Melgarejo, P.; Idris, M.

2026-04-20 health informatics 10.64898/2026.04.17.26351107 medRxiv

Top 0.2%

10.2%

Show abstract

BackgroundRecruitment remains a major barrier to timely clinical trial completion. Trialshub is an LLM-powered, chat-based platform intended to help users identify relevant trials and connect with coordinators to streamline recruitment workflows. ObjectiveTo evaluate the perceived usability and operational value of Trialshub, and identify implementation considerations for real-world deployment. MethodsA usability test was conducted at Morehouse School of Medicine for the Trialshub application. Purposively selected participants included clinical research coordinators and individuals with and without clinical trial search experience. Participants completed a pre-test survey assessing demographics, digital health information behaviors, and familiarity with AI tools, followed by a moderated usability session using a Trialshub prototype. Users completed scenario-based tasks (locating a breast cancer trial, reviewing results, and initiating coordinator contact) using a think-aloud protocol. Task ratings, screen recordings, and transcribed feedback were analyzed descriptively and thematically, and reported. ResultsParticipants reported high comfort with using digital tools and moderate-to-high familiarity with AI. Trialshubs chat-first design, guided prompts, and checklist-style eligibility display were perceived as intuitive and reduced cognitive load. Fast access to trials and the coordinator-contact workflow were viewed positively. Key usability issues included uncertainty at step transitions, insufficient cues for selecting results and next actions, and inconsistent system reliability (loading delays, errors, and broken trial detail pages). Participants also noted redundant questioning due to limited conversational memory, requested improved filtering/sorting, and clearer calls-to-action. All participants indicated that Trialshub has strong potential to meaningfully improve clinical trial processes. ConclusionsTrialshub shows promise for improving trial discovery and recruitment workflows, with identified design implications for real-world deployment.

6

Patient preferences for portable versus table-mounted visual field devices in rural Alabama: a mixed methods study within a telemedicine setting

Antwi-Adjei, E. K.; Datta, S.; Girkin, C. A.; Owsley, C.; Rhodes, L. A.; Fifolt, M.; Racette, L.

2026-04-25 ophthalmology 10.64898/2026.04.23.26351565 medRxiv

Top 0.2%

8.5%

Show abstract

Purpose To evaluate patient satisfaction and preferences for portable versus table-mounted visual field (VF) devices in a rural telemedicine setting and identify influencing factors. Methods We conducted a sequential explanatory mixed methods study at three Federally Qualified Health Centers (FQHCs) within the Alabama Screening and Intervention for Glaucoma and eye Health through Telemedicine (AL-SIGHT) study. Participants completed VF testing with table-mounted Humphrey Field Analyzer (HFA), tablet-based Melbourne Rapid Fields (MRF), and virtual reality (VR)-based VisuALL perimeters. Participants rated satisfaction, comfort, ease of use, and future testing preference. Chi-square tests assessed differences in device preferences. Twelve participants completed semi-structured interviews to explore reasons underlying preferences. Qualitative data were analyzed in NVivo 14 using reflexive thematic analysis. Results Among 271 respondents (mean age 60.4 years; 62.4% women), 50.6% preferred VR-based, 35.1% tablet-based, and 14.4% table-mounted for future testing ({chi}2 (2) = 53.52, p<0.001, Cramers V = 0.31). Satisfaction was highest for VR-based (56.9% very satisfied), followed by tablet-based (49.4%), and HFA (38.0%). VR-based perimeter was most frequently selected as the most comfortable (55.7%; {chi}2 (2) = 63.33, p<0.001, V = 0.34) and easiest to use (54.6%; {chi}2 (2) = 71.96, p<0.001, V = 0.36). Preferences did not vary significantly across demographic variables (all p>0.05). Qualitative themes identified four key drivers: comfort and physical experience, visual experience, ease of use and interaction, and psychological and motivational factors. Portability and community suitability were valued. Conclusion Rural underserved patients strongly preferred portable visual field devices, particularly VR-based, over table-mounted HFA. Comfort, ergonomic flexibility, immersive visual experience, and simplicity of interaction were central determinants of preference. Portable perimetry may enhance patient-centered glaucoma monitoring within telemedicine programs and access in resource-limited settings.

7

Uncertainty-Gated Glaucoma Screening: Combining Semi-Supervised Classification with Multi-Agent Large Language Model Deliberation

Garimella Narasimha, S. V.; Brown, N.; Sridhar, S.

2026-04-20 ophthalmology 10.64898/2026.04.17.26351127 medRxiv

Top 0.3%

8.0%

Show abstract

Automated glaucoma screening from optical coherence tomography (OCT) faces two persistent challenges: scarcity of expert-labeled data and unreliable model predictions on diagnostically ambiguous cases. We present a two-tier diagnostic pipeline that addresses both. In the first tier, an EfficientNetV2-S classifier trained under a semi-supervised pseudo supervisor framework achieves 0.84 AUC on 150 held-out test patients from the Harvard Glaucoma Detection and Progression dataset, using only 350 labeled training samples out of 700. In the second tier, 124 flagged cases are routed to a multi-agent system built on MedGemma 4B, where three specialist agents deliberate over three rounds before rendering a final diagnosis. On these flagged cases, the agent system achieves 100% sensitivity--detecting all 55 glaucoma cases with zero missed diagnoses--and 89.5% overall accuracy (111/124), compared to the classifiers 73.4% (91/124). Uncertainty analysis confirms that the classifiers output probability reliably separates confident predictions (96.3% accuracy, n = 27) from uncertain ones (74.0%, n = 123), producing a 22-percentage-point gap that serves as a triage signal. The agents fix 32 cases the classifier misclassifies while introducing 12 new errors, yielding a net improvement of 20 cases. These results are from a single training run without variance estimates and should be interpreted as preliminary evidence that uncertainty-gated routing to vision-language model agents can meaningfully improve diagnostic accuracy on the cases where automated classifiers are least reliable.

8

Identifying clinician perceived priorities for a real-time wearable system for in-hospital monitoring: findings and evolutions following the COVID-19 pandemic

Vollam, S.; Roman, C.; King, E.; Tarassenko, L.

2026-04-24 health systems and quality improvement 10.64898/2026.04.21.26350610 medRxiv

Top 0.4%

6.4%

Show abstract

A Wearable Monitoring System (WMS), comprising a chest patch, wrist-worn pulse oximeter, and arm-worn blood pressure device, was developed in preparation for a pilot Randomised Controlled Trial (RCT) on a UK surgical ward. The system was designed to support continuous physiological monitoring and early detection of deterioration. An initial prototype user interface was developed by the research team based on prior clinical experience and engineering knowledge. To ensure suitability for clinical practice, iterative user-centred refinement was undertaken through a series of clinician focus groups and wearability assessments. Six focus groups were conducted between November 2019 and May 2021 involving multidisciplinary healthcare professionals. Feedback from these sessions informed successive interface and system modifications. System development spanned the COVID-19 pandemic, during which the WMS was rapidly adapted and deployed to support clinical care on isolation wards. Feedback obtained during this period was incorporated into later versions of the system and provided a unique opportunity to examine changes in clinician priorities under pandemic conditions. Clinicians consistently prioritised alert visibility, alarm fatigue mitigation, parameter flexibility, and centralised monitoring. Notably, preferences regarding alert modality and access mechanisms evolved over time: early enthusiasm for mobile or smartphone-type devices shifted towards a preference for fixed, ward-based displays and audible alerts at the nurses station following pandemic deployment. Building on previous wearability testing in healthy volunteers, wearability testing using a validated questionnaire was completed by 169 patient participants during the RCT. The chest patch and pulse oximeter demonstrated high tolerability, whereas the blood pressure cuff showed poor wearability and was removed from the final system. These findings demonstrate the importance of iterative, clinician-led design for wearable WMS and highlight how extreme clinical contexts such as the COVID-19 pandemic can significantly reshape perceived requirements for safety-critical monitoring technologies.

9

Vital signs, demographics, and clinical events for low-birth-weight infants from four intensive care units

German Mesner, I.; Lake, D. E.; Kausch, S. L.; Krahn, K. N.; Gummadi, A.; Clark, T. W.; Niestroy, J. C.; Sahni, R.; Vesoulis, Z. A.; Gootenberg, D. B.; Ambalavanan, N.; Travers, C. P.; Fairchild, K. D.; Sullivan, B. A.

2026-04-20 pediatrics 10.64898/2026.04.15.26350178 medRxiv

Top 0.4%

6.3%

Show abstract

Premature very low birth weight (VLBW) infants have high rates of mortality and morbidity from sepsis, necrotizing enterocolitis, and respiratory failure requiring intubation and mechanical ventilation. Earlier detection of cardiorespiratory deterioration using vital signs from continuous physiological monitoring may lead to more timely interventions and improved outcomes. To further this research area, we present PreMo, a publicly available dataset of continuous heart rate and oxygen saturation, demographics, clinical events, and outcomes for 3,829 VLBW patients from four Neonatal Intensive Care Units (NICUs) in the United States. The PreMo dataset consists of a collection of parquet files, RO-Crate metadata, and sample usage code scripts hosted on the University of Virginia LibraData Dataverse website.

10

Vision Language Model for Coronary Angiogram Analysis and Report Generation: Development and Evaluation Study

Jiang, Q.; Ke, Y.; Sinisterra, L. G.; Elangovan, K.; Li, Z.; Yeo, K. K.; Jonathan, Y.; Ting, D. S. W.

2026-04-21 cardiovascular medicine 10.64898/2026.04.19.26351241 medRxiv

Top 0.4%

6.3%

Show abstract

Coronary artery disease is a leading cause of morbidity and mortality. Invasive coronary angiography is currently the gold standard in disease diagnosis. Several studies have attempted to use artificial intelligence (AI) to automate their interpretations with varying levels of success. However, most existing studies cannot generate detailed angiographic reports beyond simple classification or segmentation. This study aims to fine-tune and evaluate the performance of a Vision-Language Model (VLM) in coronary angiogram interpretation and report generation. Using twenty-thousand angiogram keyframes of 1987 patients collated across four unique datasets, we finetuned InternVL2-4B model with Low-Rank Adaptor weights that can perform stenosis detection, anatomy labelling, and report generation. The fine-tuned VLM achieved a precision of 0.56, recall of 0.64, and F1-score of 0.60 for stenosis detection. In anatomy segmentation, it attained a weighted precision of 0.50, recall of 0.43, and F1-score of 0.46, with higher scores in major vessel segments. Report generation integrating multiple angiographic projection views yielded an accuracy of 0.42, negative predictive value of 0.58 and specificity of 0.52. This study demonstrates the potential of using VLM to streamline angiogram interpretation to rapidly provide actionable information to guide management, support care in resource-limited settings, and audit the appropriateness of coronary interventions. AUTHOR SUMMARYCoronary artery disease has heavy disease burden worldwide and coronary angiogram is the gold standard imaging for its diagnosis. Interpreting these complex images and producing clinical reports require significant expertise and time. In this study, we fine-tuned and investigated an open-source VLM, InternVL2-4B, to interpret and report coronary angiogram images in key tasks including stenosis detection, anatomy identification, as well as full report generation. We also referenced the fine-tuned InternVL2-4B against state-of-the-art segmentation model, YOLOv8x, which was evaluated on the same test sets. We examined how machine learning metrics like the intersection over union score may not fully capture the clinical accuracy of model predictions and discussed the limitations of relying solely on these metrics for evaluating clinical AI systems. Although the model has not yet achieved expert-level interpretation, our results demonstrate the potential and feasibility of automating the reporting of coronary angiograms. Such systems could potentially assist cardiologists by improving reporting efficiency, highlightning lesions that may require review, and enabling automated calculations of clinical scores such as the SYNTAX score.

11

Design and preliminary safety validation of a hybrid deterministic-AI triage system for multilingual primary healthcare: a WhatsApp-based vignette study in South Africa

Nkosi-Mjadu, B. E.

2026-04-22 health informatics 10.64898/2026.04.21.26349781 medRxiv

Top 0.4%

6.2%

Show abstract

BackgroundSouth Africas public healthcare system serves most of the population through approximately 3,900 primary healthcare clinics characterised by long waiting times and high volumes of repeat-prescription visits. No published pre-arrival digital triage system operates across all 11 official South African languages while aligning with the South African Triage Scale (SATS). This paper reports the design and preliminary safety validation of BIZUSIZO, a hybrid deterministic-AI WhatsApp triage system. MethodsBIZUSIZO delivers SATS-aligned triage via WhatsApp, combining AI-assisted free-text classification (Claude Haiku 4.5) with a Deterministic Clinical Safety Layer (DCSL) that overrides AI output for 53 clinical discriminator categories (14 RED, 19 ORANGE, 20 YELLOW) coded in all 11 official languages and independent of AI availability. A five-domain risk factor assessment can only upgrade triage level. One hundred and twenty clinical vignettes in patient language (English, isiZulu, isiXhosa, Afrikaans; 30 per language) were scored against a developer-assigned gold standard with independent blinded nurse review. A 121-vignette multilingual DCSL safety consistency check across all 11 languages and a 220-call post-hoc framing sensitivity evaluation (110 paired vignettes) were also conducted. ResultsUnder-triage was 3.3% (4/120; 95% CI: 0.9%-8.3%) with no RED under-triage; exact concordance was 80.0% (96/120) and quadratic weighted kappa 0.891 (95% CI: 0.827-0.932). One two-level under-triage was observed on a non-RED presentation (V072, isiXhosa burns vignette, ORANGEGREEN); one two-level over-triage was observed (V054, isiZulu deep laceration, YELLOWRED). In the framing sensitivity evaluation, AI-only classification achieved 50.9% RED invariance under adversarial framing; full-pipeline classification achieved 95.0% in four validated languages, with the DCSL rescuing 18 of 23 AI drift cases. ConclusionsA hybrid deterministic-AI triage system with DCSL-based emergency detection achieved zero RED under-triage and consistent RED detection across all 11 official languages. The 16.7% over-triage rate falls within published South African SATS ranges (13.1-49%). A single two-level under-triage event was observed on an isiXhosa burns vignette (ORANGEGREEN) and is discussed in Limitations. Findings are preliminary; prospective validation against independent nurse triage is the necessary next step.

12

AI/ML-based prediction of TB treatment failure: A systematic review and meta-analysis

Kamulegeya, R.; Nabatanzi, R.; Semugenze, D.; Mugala, F.; Takuwa, M.; Nasinghe, E.; Musinguzi, D.; Namiiro, S.; Katumba, A.; Ssengooba, W.; Nakatumba-Nabende, J.; Kivunike, F. N.; Kateete, D. P.

2026-04-22 infectious diseases 10.64898/2026.04.16.26350453 medRxiv

Top 0.4%

6.2%

Show abstract

BackgroundTuberculosis (TB) remains a leading cause of infectious disease mortality worldwide, and treatment failure contributes to ongoing transmission, drug resistance, and poor clinical outcomes. Artificial intelligence and machine learning approaches have attracted growing interest for predicting tuberculosis treatment outcomes, but the literature is heterogeneous and lacks a comprehensive synthesis. MethodsWe conducted a systematic review and meta-analysis of studies that developed or validated machine learning models to predict TB treatment failure. We searched PubMed/MEDLINE and Embase from January 2000 to October 2025. Studies were eligible if they developed, validated, or implemented an artificial intelligence or machine learning model for the prediction of TB treatment failure or a closely related poor outcome in patients receiving anti-TB treatment. Risk of bias was assessed using the Prediction model Risk Of Bias Assessment Tool. Random-effects meta-analysis was performed to pool area under the curve values, with subgroup analyses and meta-regression to explore heterogeneity. ResultsThirty-four studies were included in the systematic review, of which 19 reported area under the curve values suitable for meta-analysis (total participants, 100,790). Studies were published between 2014 and 2025, with 91% published from 2019 onward. Tree-based methods were the most common algorithm family (52.9%), and multimodal models integrating three or more data types were used in 41.2% of studies. The pooled area under the curve was 0.836 (95% confidence interval 0.799-0.868), with substantial heterogeneity (I{superscript 2} = 97.9%). In subgroup analyses, studies including HIV-positive participants showed lower discrimination (pooled area under the curve 0.748) compared to those excluding them (0.924). Only eight studies (23.5%) performed external validation, and only one study (2.9%) was rated as low risk of bias overall, primarily due to methodological concerns in the analysis domain. Eggers test suggested publication bias (p = 0.024). Major evidence gaps included underrepresentation of high-burden countries, HIV-affected populations, social determinants, pediatric TB, and extrapulmonary disease. ConclusionsMachine learning models for predicting TB treatment failure show promising discrimination but are not yet ready for routine clinical implementation. Performance varies substantially across populations and settings, and methodological limitations, including inadequate validation, poor calibration assessment, and high risk of bias, limit confidence in current estimates. Future research should prioritize rigorous external validation, calibration assessment, and development in underrepresented populations, particularly HIV-affected and high-burden settings. Author SummaryTB kills over a million people annually. While curable, treatment failure remains common and drives ongoing transmission and drug resistance. Researchers increasingly use artificial intelligence and machine learning to predict which patients will fail treatment, but it is unclear if these models are ready for clinical use. We reviewed 34 studies including nearly 1.1 million participants from 22 countries. On average, models correctly distinguished patients who would fail treatment from those who would not 84% of the time, a performance generally considered good. However, this average hid enormous variation. Models developed in populations including HIV-positive people performed substantially worse, suggesting prediction is harder with HIV co-infection. Worryingly, only one study used high-quality methods; 97% had serious flaws in handling missing data, checking calibration, or testing in new populations. Only eight studies validated their models in different settings. To conclude, we found that machine learning is promising in predicting TB treatment failure, but it is not ready for clinical use. Researchers should prioritize validation in high-burden settings, include social determinants, and improve methodological rigor before these tools can help patients.

13

Dissecting clinical reasoning failures in frontier artificial intelligence using 10,000 synthetic cases

Auger, S. D.; Varley, J.; Hargovan, M.; Scott, G.

2026-04-23 neurology 10.64898/2026.04.22.26351488 medRxiv

Top 0.4%

5.0%

Show abstract

Background: Current medical large language model (LLM) evaluations largely rely on small collections of cases, whereas rigorous safety testing requires large-scale, diverse, and complex cases with verifiable ground truth. Multiple Sclerosis (MS) provides an ideal evaluation model, with validated diagnostic criteria and numerous paraclinical tests informing differential diagnosis, investigation, and management. Methods: We generated synthetic MS cases with ground-truth labels for diagnosis, localisation, and management. Four frontier LLMs (Gemini 3 Pro/Flash, GPT 5.2/5 mini) were instructed to analyse cases to provide anatomical localisation, differential diagnoses, investigations, and management plans. An automated evaluator compared these outputs to the ground-truth labels. Blinded subspecialty experts validated 70 cases for realism and automated evaluator accuracy. We then evaluated LLM decision-making across 1,000 cases and scaled to 10,000 to characterise rare, catastrophic failures. Results: Subspecialist expert review confirmed 100% synthetic case realism and 99.8% (95% CI 95.5 to 100) automated evaluation accuracy. Across 1,000 generated MS cases, all LLMs successfully included MS in the differential diagnoses for more than 91% cases. However, diagnostic competence did not associate with treatment safety. Gemini 3 models had low rates of clinically appropriate steroid recommendations (Flash: 7.2% 95% CI 5.6 to 8.8; Pro: 15.8% 95% CI 13.6 to 18.1) compared to GPT 5 mini (23.5% 95% CI 20.8 to 26.1), frequently overlooking contraindications like active infection. OpenAI models inappropriately recommended acute intravenous thrombolysis for MS cases (9.6% GPT 5.2; 6.4% GPT 5 mini) compared to below 1% for Gemini models. Expanded evaluation (to 10,000 cases) probed these errors in detail. Thrombolysis was recommended in 10.1% of cases lacking symptom timing information and paradoxically persisted (2.9%) even when symptoms were explicitly documented as more than 14 days old. Conclusion: Automated expert-level evaluation across 10,000 cases characterised artificial intelligence clinical blind spots hitherto invisible to small-scale testing. Massive-scale simulation and automated interrogation should become standard for uncovering serious failures and implementing safety guardrails before clinical deployment exposes patients to risk.

14

Multimodal prediction of visual improvement in diabetic macular edema using real-world electronic health records and optical coherence tomography images

Sun, S.; Cai, C. X.; Fan, R.; You, S.; Tran, D.; Rao, P. K.; Suchard, M. A.; Wang, Y.; Lee, C. S.; Lee, A. Y.; Zhang, L.

2026-04-24 health informatics 10.64898/2026.04.23.26351616 medRxiv

Top 0.5%

4.6%

Show abstract

Multimodal learning has the potential to improve clinical prediction by integrating complementary data sources, but the incremental value of imaging beyond structured electronic health record (EHR) data remains unclear in real-world settings. We developed a multimodal survival modeling framework integrating optical coherence tomography (OCT) and EHR data to predict time to visual improvement in patients with diabetic macular edema (DME), and evaluated how different ophthalmic foundation model representations contribute to prognostic performance. In a retrospective cohort of 973 patients (1,450 eyes) receiving anti-vascular endothelial growth factor therapy, we compared multimodal models combining 22,227 EHR variables with 196,402 OCT images, with OCT embeddings derived from three ophthalmic foundation models (RETFound, EyeCLIP, and VisionFM). The EHR-only model showed minimal prognostic discrimination (C-index 0.50 [95% CI, 0.45-0.55]). Incorporating OCT improved performance, with the magnitude of improvement depending on the representation. EHR+RETFound achieved the strongest performance (C-index 0.59 [0.54-0.65]), followed by EHR+EyeCLIP (0.57 [0.52-0.62]) and EHR+VisionFM (0.56 [0.51-0.61]). Multimodal models, particularly EHR+RETFound, demonstrated improved risk stratification with clearer separation of Kaplan-Meier curves. Partial information decomposition revealed that prognostic information was dominated by modality-specific contributions, with OCT and EHR providing largely distinct signals and minimal shared information. The magnitude of OCT-specific contribution varied across foundation models and aligned with observed performance differences. These findings indicate that OCT provides complementary prognostic value beyond structured clinical data, but gains are modest and depend strongly on representation choice. Our results highlight both the promise of multimodal modeling for personalized prognosis and the need for rigorous, context-specific evaluation of foundation models in real-world clinical settings.

15

When Data Meets Practice: A Qualitative Study of Clinician Perspectives on Streaming Data in Mental Health

Tian, J.; Kurkova, V.; Wu, Y.; Adu, M.; Hayward, J.; Greenshaw, A. J.; Cao, B.

2026-04-25 psychiatry and clinical psychology 10.64898/2026.04.23.26351640 medRxiv

Top 0.6%

4.0%

Show abstract

Patient-generated streaming data from wearable and digital technologies is increasingly promoted as a means of supporting mental health monitoring and clinical decision-making. While patient acceptance of these technologies has been reported, clinician perspectives remain underexplored despite their central role in determining whether streaming data are meaningfully integrated into routine care. This study explored clinicians experiences, as well as perceived facilitators and barriers, related to integrating patient-generated streaming data into routine mental health practice. A qualitative, exploratory interview study was conducted to examine clinicians experiences and perspectives on integrating patient-generated streaming data into mental health care. Semi-structured interviews were conducted with 33 clinicians, including family physicians (n=11), psychiatrists (n=12), and psychologists (n=10). Data were analyzed using reflexive thematic analysis guided by Braun and Clarkes six-step approach. Six themes were identified. Clinicians described variable use of digital and streaming technologies, ranging from routine engagement to deliberate non-use. Streaming data were viewed as clinically valuable when they provided longitudinal and objective insights, identified physiological and behavioural pattern changes, and supported patient engagement. However, clinicians emphasized that clinical usefulness was contingent on interpretability, contextual information, and relevance to decision-making. Major barriers included poor integration with electronic medical records, time constraints, data volume, limited organizational support, and uncertainty regarding data reliability and validity. Clinicians also expressed persistent concerns about privacy, governance, and regulatory oversight, highlighting the need for clear safeguards and accountability structures. Clinicians view patient-generated streaming data as a promising adjunct to mental health care, particularly for capturing longitudinal change between visits. However, meaningful clinical integration remains constrained by usability, workflow, organizational, and regulatory challenges, as well as limited confidence in data interpretation. Addressing these barriers through improved system integration, interpretive support, validation, and governance will be essential for translating the potential of streaming data into routine clinical practice.

16

Assessing medication-related burden and medication adherence among older patients from Central Nepal: A machine learning approach

Giri, R.; Agrawal, R.; Lamichhane, S. R.; Barma, S.; Mahatara, R.

2026-04-23 geriatric medicine 10.64898/2026.04.22.26351447 medRxiv

Top 0.6%

3.8%

Show abstract

We are pleased to submit our Original article entitled "Assessing medication-related burden and medication adherence among older patients from Central Nepal: A machine learning approach" for consideration in your esteemed journal. In this paper, we assessed medication burden using validated Living with medicines Questionnaire (LMQ-3) and medication adherence using Adherence to Medication refills (ARMS) Scale. In this paper we analysed our result through machine learning approach in spite of traditional statistical approach to identify the complex factors influencing both. Six ML architectures (Ordinary Least Square, LightGBM, Random Forest, XGBoost, SVM, and Penalized linear regression) were employed to predict ARMS and LMQ scores using various socio-demographic, clinical and medication-related predictive features. Model explainability was provided through SHAP (Shapley Additive exPlanations). Our study identified the moderate medication burden with moderate non-adherence among older adults. Requiring assistance for medication and polypharmacy were the strongest drivers for the medication burden and non-adherence. The high predictive accuracy by ML suggests the appropriate clinical intervention like deprescribing to cope with the high prevalent medication burden and non-adherence among older adults in Nepal.

17

Individualized Forecasting of Headache Attack Risk Using a Continuously Updating Model

Houle, T. T.; Lebowitz, A.; Chtay, I.; Patel, T.; McGeary, D. D.; Turner, D. P.

2026-04-22 neurology 10.64898/2026.04.20.26350119 medRxiv

Top 0.6%

3.7%

Show abstract

ImportanceMigraine attacks often occur unpredictably, limiting the ability of individuals to initiate timely preventive or preemptive treatment. Short-term probabilistic forecasting of migraine risk could enable more targeted management strategies. ObjectiveTo externally validate the previously developed Headache Prediction Model (HAPRED-I), evaluate an updated continuously learning model (HAPRED-II), and assess the feasibility and short-term safety of delivering individualized probabilistic migraine forecasts directly to patients. Design, Setting, and ParticipantsProspective 8-week cohort study conducted remotely at two academic medical centers in the United States (Massachusetts General Hospital and Wake Forest Health Sciences) between 2015 and 2019. Adults with recurrent migraine or tension-type headache completed twice-daily electronic diaries. A total of 230 participants contributed 23,335 diary entries across 11,862 participant-days of observation. Main Outcomes and MeasuresOccurrence of a headache attack within 24 hours following each evening diary entry. Model performance was evaluated using discrimination (area under the receiver operating characteristic curve [AUC]) and calibration. ResultsExternal validation of HAPRED-I demonstrated modest discrimination (AUC, 0.59; 95% CI, 0.57-0.61) and poor calibration, with predicted probabilities consistently exceeding observed headache risk. In contrast, the continuously updating HAPRED-II model demonstrated progressive improvement in predictive performance as participant-specific data accumulated. Discrimination increased from an AUC of 0.59 (95% CI, 0.57-0.61) during the first 14 days to 0.66 (95% CI, 0.63-0.70) after the first month, accompanied by improved calibration across predicted risk levels. Over the study period, 6999 individualized forecasts were delivered directly to participants. No evidence suggested that receipt of forecasts was associated with increasing headache frequency or worsening predicted headache risk trajectories. Conclusions and RelevanceA static migraine forecasting model demonstrated limited transportability to new individuals. In contrast, models that continuously update within individuals may improve predictive accuracy over time and enable real-time delivery of personalized migraine risk forecasts. Further work incorporating richer physiologic and contextual predictors will likely be necessary before such systems can reliably guide clinical treatment decisions.

18

Decision Curve Analysis for Evaluating Machine Learning Models for Next-Day Transfer Out of ICU

Pozo, M.; Pape, A.; Locke, B.; Pettine, W. W.

2026-04-21 health informatics 10.64898/2026.04.19.26351213 medRxiv

Top 0.6%

3.7%

Show abstract

Timely identification of intensive care unit (ICU) patients likely to exit the unit can support anticipatory workflows such as chart review, eligibility screening, and patient outreach prior to transfer. Most ICU discharge prediction studies report discrimination and calibration, but these metrics do not quantify the decision consequences of acting on predictions. Using adult ICU admissions from MIMIC-IV, we represented each ICU stay as a sequence of daily clinical summaries and trained logistic regression, random forest, and XGBoost models to predict next day ICU transfer. Models achieved ROC AUC of 0.80-0.84 with differing calibration. We evaluated decision utility using decision curve analysis (DCA), where positive predictions trigger proactive review. Across thresholds, model guided strategies outperformed review-all, review-none, and a simple clinical rule. To translate net benefit into implementable operations, we modeled a clinical trial recruitment workflow with an 8 hour daily time constraint, incorporating chart review and consent effort. At a feasible operating threshold (0.23), the model flagged [~]23 charts/day and yielded [~]1.23 enrollments/day under conservative eligibility and consent assumptions. These results demonstrate that DCA provides a transparent framework for determining when ICU transfer predictions are worth using and how thresholds should be selected to align with real world workflow constraints. Data and Code AvailabilityThis research has been conducted using data from MIMIC-IV. Researchers can request access via PhysioNet. Implementation code is available upon request.

19

MedSAM2-CXR: A Box-Latent Framework for Chest X-ray Classification and Report Generation

Hakata, Y.; Oikawa, M.; Fujisawa, S.

2026-04-22 health informatics 10.64898/2026.04.20.26351338 medRxiv

Top 0.7%

3.6%

Show abstract

Who is affectedIn Japan, approximately 100 million chest radiographs (CXRs) are acquired annually, while only about 7,000 board-certified diagnostic radiologists practice nationwide (Japan Radiological Society workforce statistics; OECD Health Statistics, most recent available year). This implies an average workload exceeding 10,000 imaging studies per radiologist per year if all CXRs were attributed to board-certified diagnostic radiologists (an upper-bound estimate, because in practice many CXRs are primarily read by non-radiologist physicians). In settings such as night shifts, weekends, remote islands, and regional care networks, non-radiologist physicians frequently act as primary readers. Despite strong demand for AI assistance, existing systems are typically limited by one of three shortcomings -- poor cross-institutional generalization, limited interpretability, or inability to generate draft reports -- and consequently see limited clinical deployment. What we builtWe propose a Box-Latent Trinity that embeds each image as a hyperrectangle parameterized by a center c and a radius r, rather than as a single point in a latent space. We further introduce BL-TTA (Box-Latent Test-Time Augmentation), which approximately closes the train-inference gap (exact in the N [->] {infty} limit; N = 8 suffices in practice) by averaging predictions over samples drawn from within the latent box at inference time. Both components are implemented on top of the frozen MedSAM2 medical imaging foundation model. A single box representation simultaneously supports three functions: (A) theoretically grounded source selection, (B) device-invariant augmentation, and (C) case-based retrieval-augmented generation (RAG). Each prediction is accompanied by retrieved similar prior cases, a calibrated confidence estimate, and clinical-guideline references. How well it performsOn the Open-i CXR corpus (2,954 image-report pairs) under a patient-level 80/10/10 split and 5-seed reproducibility, the full system B5 achieves macro area under the receiver-operating-characteristic curve (macro-AUROC) 0.639 (best-seed test; 5-seed mean 0.626, Table 2; absolute +0.015 over the strongest same-backbone baseline, Merlin-style 0.624), elementwise accuracy 0.753 (absolute +0.072 over Merlin-style 0.681 -- equivalent to approximately 7 fewer label-level errors per 100 (label, image) predictions across 14 finding labels, not per 100 images), and report label-F1 0.435 (absolute +0.086, relative +25 % over the strongest same-backbone report-generation baseline, Bootstrapping-style 0.349). Under simulated pixel-space device-shift intensities up to twice the training distribution, AUROC degrades by only 0.014. Brier score (macro) is 0.061; Cohens{kappa} between two independent rule-based label extractors is 0.702 (substantial agreement); the box radius yields an out-of-distribution (OOD) detection AUROC of 0.595; and the framework provides four structural explainable-AI (XAI) outputs -- retrieved similar cases, confidence tier, per-axis uncertainty, and visual saliency -- which we jointly quantify in a single CXR study, a combination that, to our knowledge, has not been reported previously. O_TBL View this table: org.highwire.dtl.DTLVardef@d8ced6org.highwire.dtl.DTLVardef@1f3471dorg.highwire.dtl.DTLVardef@c1c2f1org.highwire.dtl.DTLVardef@e589bdorg.highwire.dtl.DTLVardef@1b5e410_HPS_FORMAT_FIGEXP M_TBL C_TBL Path to deploymentBecause the complete experiment can be reproduced in under two hours on a consumer-grade GPU (NVIDIA RTX 4060, 8 GB VRAM), the framework can run on compute resources already available at typical healthcare institutions. The approach thus supports the practical delivery of evidence-grounded diagnostic support to night shifts, remote-island care, and secondary readings in health checkups -- settings in which a board-certified radiologist is not locally available. One-sentence summaryReproducible end-to-end in under two hours on a single consumer-grade GPU, the proposed framework outperforms the strongest same-backbone medical-AI baselines on three principal metrics, maintains accuracy under simulated device shifts, and automatically drafts evidence-grounded radiology reports, offering a reproducible and compute-efficient direction toward reducing the reading burden of Japanese radiologists, subject to external validation.

20

Generalizing intensive care AI across time scales in resource-limited settings

Devadiga, A.; Singh, P.; Sankar, J.; Lodha, R.; Sethi, T.

2026-04-24 health informatics 10.64898/2026.04.23.26351588 medRxiv

Top 0.7%

3.6%

Show abstract

Temporal resolution of physiological monitoring in intensive care varies widely across healthcare systems. Artificial intelligence models assume a uniform and fixed frequency of sampling, thus limiting the generalizability of models, especially to resource-limited settings. Here, we propose a novel resolution-transfer task for physiological time series and ask whether models trained on high-resolution data can generalize to a low data-density setting without the need to retrain them. SafeICU, a novel longitudinal pediatric intensive care dataset spanning ten years from a tertiary care hospital in India, was used to test this hypothesis. Self-supervised transformer models were trained on 144,271 patient-hours of high-resolution physiological signals from 984 pediatric ICU stays to learn representations of heart rate, respiratory rate, oxygen saturation, and arterial blood pressure. Transfer of this model to low-resolution data established robust performance in clinically relevant lower-frequency intervals, consistently outperforming models trained directly at coarser resolutions. Further, these representations generalized across patient populations, maintaining performance when evaluated on adult intensive care cohorts from the MIMIC-III and eICU databases without retraining. In a downstream task of early shock prediction, models achieved strong discrimination in the pediatric cohort (area under the receiver operating characteristic curve (AUROC) 0.87; area under the precision-recall curve (AUPRC) 0.92) and retained stable performance across monitoring intervals from 10 to 60 minutes (AUROC 0.78-0.88). Together, these results demonstrate that physiological representations learned from high-resolution data enable time-scale-robust and transferable AI for intensive care. The publicly released SafeICU dataset, comprising longitudinal vital signs, laboratory measurements, treatment records, microbiology, and admission and discharge, provides a foundation for developing and deploying generalizable clinical AI in resource-limited settings.